DOCOHENT BESOHE 

Ttl BIO 339 

Anderson, Bonald £« : And Others 

Hethodological Considerations in the Development of 
Indicators of Achieveaent in the NAEP Data. 
Minnesota Oniv*, Minneapolis. Center fcr Social 
Research. 

National Science Foundation, Bashington 9 
[81] 

NSF-SED-79-17259 
29p^ 

HF01/PC02 Plus Postage* 

♦Achievement Bating: *Eata Analysis: Educational 
Assessment: Elementary Secondary Education: National 
Surveys: ♦Tvesearch Design : * Research Methodology : 
♦Research Problems 

♦Data Interpretation: ♦National Assessment of 
Educational Progress 



The organization of data at the National Assessment 
of Educational Progress (NAEP) is undergoing a significant transition 
from a system designed only for national assessment purposes to one 
designed both for assessment and a variety of academic research 
interests* The advent of NAEP public-use data files opens up many 
possibilities for those who have the skills^ time^ and resources to 
do secondary aralysis^ An analysis of the mathematics test items is 
presented which demonstrates alternate procedures for developing 
indicators of mathematics achievement. This analysis demonstrates 
that the NAEP item subsets will not always meet conventional 
psychometric criteria- This failure to meet standard achievement test 
criteria does not mean that secondary analysis of the data is unwise- 
It dees implSr howeverr that interpretation of findings^ especially 
those using subtests, must be made cautiously. Limitations of the 
methodology must be acknowledged* Conventional achievement testing is 
not item-centered liJce assessment testing* The measurement priority 
of assessBent is stability across multiple testings, not relative 
comparisons aacng persons* Consequently, standards of item 
discrimination and construct validity have obviously less import, of 
far greater iaportance for assessments are standards of face 
validity, content validity, internal consistency and the application 
cf rigorous data analysis techniques. (Author/?!) 



ED 199 299 

AOTHOB 
TITLE 

IBSTITOTICN 

SPCro AGENCI 
POB DATE 
GBAHT 
HOTE 

EDSS PBICE 
DESCBIPTCBS 



IDENTIFIEBS 



ABSTBACT 



* Be productions supplied by EDES are the best that can be made ♦ 

♦ from the original document. ♦ 
♦♦♦♦♦♦♦♦♦♦♦♦♦*♦*******♦**#**#****************♦♦♦******* 

erJc 



METHODOLOGICAL CONSIDERATIONS IN THE DEVELOPMENT 
OF INDICATORS OF ACHIEVEMEOT IN THE NAEP DATA 



DEFAItTMEIirT OF EOUCATIOM 

NATJONAt. INSTITUTE OF EDUCATION 
EDUCATIONAL RESOURCES INFORMATION 

CENTER (ERIC» 
PC This document Kas txxrn reproduced j& 
ftK;mvod ifom th« person Of organi/nlion 
oogiruttfig tt 

Minor chdngtrs h.iv»! tMH'n nMdi? \o improve 
reproduction quality. 

• Potnti of vK?w or opinions staUKi m thisdocu 
rmsnl do not mKv's&c^nly roprusfint off ictal NIE 

poMlKsri or policy 



Ronald E. Anderson 
Wayne W. Welch 
Linda J. Harris 



•PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC)." 



Minnesota Center for Social Research 
2122 Riverside Avenue 
University of Minnesota 
Minneapolis, Minnesota 55454 



This article was prepared with the support of 
National Science Foundation Grant No. SED 79-17259. 
Any opinions y findings, conclusions or r^^coL-nendations 
expressed are those of the authors and do not *n :essarily 
reflect the views of the National Science Foundation 

or anyone else. 



n 



2 



METHODOLOGICAL CONSIDERATIONS IN THE DEVELOPMENT 
OF INDICATORS OF ACHIEVEMENT IN TliE NAEP DATA 



ABSTRACT 



The sanitation of data at the National Assessment of Educational 
Progress (NAEP) is undergoing a significant transition from a system 
designed only for national assessment purposes to one designed both for 
assessment and a variety of a..ci .ic research interests. Those 
acquiring NAEP data for secondary ^.r .lysis must be aware of the 
orgsniration of the data due to its historical emphasis upon assessment. 
Researchers should be cognizant of the unique features of NAEP data as 
well as potential sources of error in using the data. In its ongoing 
analysis of the thirty files fro:r 1977-78 assessment of mathematics. 
tl.e research group at the Univei..ty of Minnesota has identified ax.d 
suooarized some of these positive and negative features. In this report 
we review alternate procedures for evaluating measures of achievement at 
the level of the individual data packaj^e. Some findings are presented 
and their methodological implications discussed. 



ERIC 



1 

3 



5 



1 



INTRODUCTION 

The National Assessment of Education Progress (NAEP) and their 
findings have been exterslvely discussed in both the popular press and 
In the academic literature, .^s described by Wright, Larsson, and Remlow 
(1981), the management of NAEP data has recently en reorganized In 
order to facilitate a wide variety of secondary analyses. Specifically, 
NAEP data are now disseminated as a series of publicuse data tapes 
vhere each data file contains the results of one assess . booklet 
(package) for one age group during one year or cycle of assc sment. As 
discussed In an earlier paper (Andersen, Welch, Harris, 1980), this 
organization of the data has some serious Implications for the secondary 
analyst* In this paper we are concerned only with the problems of 
identifying and developing Indicators of achievement. 

Tyler (1970), Wooer (1973) and others have eli.borated on the 
differences between assessment and testing, especially standardized 
achievement testing. The main difference Is that standardized testing 
seeks to ccapare and rank student scores whereas asi^essnent ignores 
Individual scores and reports on the state of the system and Its major 
social groups. Mlllman (1978) clarifies this distinction further by 
contrasting three major types of assessment models: (1) item-centered, 
where many test Items are used to assess a large content domain, (2) 
objectlve-centered, where each objective is represented by a number, 
e-g*» 5, of test items, and (3) subtest-centered, where the focus is 
upon several subtests, each of which is measured with a moderate number 
of items* NAEP has always been item-centered in its approach, but the 
new public- use data system fosters a subtest-centered approach. With 
several hundred test exercises (items), NAEP assesses a content domain 



EKLC 



2 

4 



I 



very extensively but specific subdomains and their associated subtests 
may not be thoroughly covered. This problem is magnified when the 
analysis becomes package-centered as with the public-use data tapes. 
Not only nay soae subdomains be very underrepresented, but the overall 
domain may not be well represented by the liu.ted set of items within a 
single booklet /package. 

It Is quite reasonable to expect many secondary data analysts to be 
subtest-centered in their analysis interests. For instance, a 
researcher might want to identify correlates of "spatial reasoning". 
Spatial reasoning is a subset of the NAEP "shapes, sizes, and 
relationships- subdomain, which is a subset of the mathematics domain. 
NAEP. m their data analysis system, can analyze the relationship 
between a subtest and standard reporting variables by aggregating across 
all packages. This is much more difficult for a secondary analyst of 
public- use tapes; separate processing of each of ten to thirty files is 
necessary. Further, many interesting analyses are not possible this way 
because many independent as well as dependent variables are included in 
only one or two packages. For many potentially interesting analyses, 
the secondary analyst must settle for analysis of one and only one 
package. Before proceeding with such analysis, the methodologically 
cautious researcher must check for the validity and reliability of the 
package's specific indicators. Thus the secondary analyst can not, as 
does NAEP, depend upon the large bank of test items utilized in an 
item-centered assessment. At the package-specific Icval of analysis, 
where only a small portion of the content domain is included, the 
analyst is obligated to substantiate the psychometric e--dness of the 
measures available. One exception to this is the of single 



ERIC 



Achleveiaent items for analytic purposes. For instance, in error 
analysis one might look at the frequency of occurrence of specific 
incorrect responses to a specific item stem* 

In the past, NAEP has deemphasized or avoided psychometric 
techniques such as item analysis, factor analysis and the selection of 
items on the basis of discrimination. In addition, NAEP generally does 
not complete estimates of reliability, how er a dissertation by Mullis 
(1978) on the 1976 NAEP supplementary assessments of mathematics and 
political knowledge, does report reliabilities for achievement 
exercises. The reliabilities for mathematics averaged 0.91 but for 
political knowledge averaged 0.71 in Mullis" analysis. 

Since our analysis and most fr ure secondary analysis of NAEP data 
will necessarily be data package oriented, reliability and validity 
analyses are icportant for the refinement of indicators. At the data 
package level one is limited to a relatively small set of achievement 
items, especially if one needs scale or subtest scores to examine 
specific subdomains. Since the number of items per content area varies 
considerably across data packages, the content validity and reliability 
of indicators must be evaluated separately for each data set. 

In recommending psychometric analyses of NAEP items at the package 
level, we are not implicitly criticizing NAEP methodology. The primary 
goal of NAEP is to measure changes in educational performance and to 
report these to the nation for purposes of policy formulation. 
Consequently, there has been little need to compute total achievement 
scores. In fact, until recently NAEP has avoided the use of the term 
"achievement" in its various reports of progress or performance of the 
nation. In secondary analysis the goals may change, thus the methods 



OHJSt change to meet the corresponding requirements that insure the 
highest quality of an analysis. 

EVALUATING PACKAGE-SPECIFIC INDICATORS OF ACHIEVEMENT 

On the basis of our analysis of major sources of error in NAEP data 
(Anderson, Welch, Harris, 1980) and using the recommendations of the 
conventional wisdom in educational testing (cf. Lord and Novick, 1968; 
Guilford, 1954; Nunnally, 1978), a set of procedures have been outlined. 
These procedures are listed in Figure 1 and are designed to guide the 
evaluation of NAEP achievement Indicators. 

We applied these procedures to three data packages from the '977-78 
NAEP mathcnatlca assessment of 17-year- olds. While the next section is 
limited to this mathematics data, the problems and procedures are 
relevant to other content areas as well. 
(1) CONTENT VALIDITY 

The planning for the 1977-78 NAFP mathematics assessment was guided 
by a two dimensional classification of the content domain (Figure 2). 
One dimension includes five subject matter content areas and four 
cogitltive process levels* These categories are defined in the 
Mathematics Objectives book (National Assessment of Educational 
Progress, 1978) as follows: 
(A.) Numbers and Numeration 

This category contains tl ? rgest nunioer of exercises because of 
its Importance in the curriculum. Exercises deal with the way numbers 
are used, processed or written. Knowledge and understanding of 
numeration and rumber concepts are assessed for^ whole numbers, 
fractions, decimals, integers and percents, with considerable emphasis 



1 



placed on operations* Number properties and order relations are also 
included* Problaor-solving exercises include routine number problems, 
nonroutlne prollems and consumer problems • Nonroutine problems are 
exercises not normally encountered in the curriculum, but underptandable 
to the age group* Consumer problems deal primarily with the uses of 
oatheaaticrf in commercial situations (for example, buying and selling, 
interpreting graphs and saving money; und are emphasized more at the 
17-year-old level than at the two younger age levels. 

(BO Variables and Relationships 

The use of variables and relationships correspondw to an iaoortant 
part of the cchool mathematics curriculum* The exercises for this 
content category deal with facts, <?efinitions and symbols of algebra; 
the use of variables In equations and inequalities, the use of variaoles 
to represent elements of a number system; and exponential and 
trigonometric functions. There are very few exercises appropriate for 
9-year-clds in this category, and only a few topics are appropriate for 
13-year-olds • Many more are appropriate at the 17-year-old level* 

(CO Shape, Size and Position 

The exercises in this content calegory measure objectives related 
to school geometry* The emphasis in the assessment is not on geometry 
as a formal deductive system* The exercises concern plane and solid 
shapes, congruence, similarity, properties of triangles, properties of 
quadrilaterals, constructions, sections of solids, other basic theorems 
and relationships, rotations and symiretry* 

(D«) Measurement 

A portion 3 assessment is devoted to measurement, reflec ing 

increased emp on measurement in the school curriculum* The 



ERJC g 



exercises cover appropriate units; equivalence relations; instrument 
reading; length, weight, capacity, time and temperature; perimeter, area 
and volume; non-standard units; end precision and interpolation. A 
substantial number of the measurement exercises require the use of 
metric units* 
(£•) Other Topics 

Other mathematical content topics included in this assessment at 
all age levels are probability and statistics; graphs, tables and 
charts; and logic. Special assessment exercises and procedures have 
been developed to assess attitudes related to mathematics, computer 
literacy and the use of the hand calculator. 

The four cognitive processes identified in the objectives plan were: 
Mathi»Datical Knowledge 

Mathematical knowledge refers to the recall and recognition of 
mathematical ideas expressed in words, symbols or figures. Mathematical 
knowledge relies, for the most part, on memory processes. It does not 
ordinarily require other more complex mental processes. 

Exercises that assess mathematical '..riOwledge require thai a person 
recall or recognize one or more iiems of information. An exanjple of an 
exercise involving recall wouJd be one that asks for a multiplication 
fact such as the product of five and two. 
(11.) Mathematical Skill 

Mathematical skill refers to the routine manipulation of 
mathematical ideas. Mathematical skill relies on algorithmic processes. 
An algorithm is a standard procedure that always leads to an answer. 
Mathematical skill -equires the recollection ol how to use the 
algorithm. 



(Ill*) Mathematical Understanding 

Mathematical understanding refers to the explanation and 
interpretation of mathematical knowledge. Mathematical understanding 
relies primarily on translation processes. The mathematical knowledge 
can be exp revised in words, symbols or figures; and the translation may 
be within or between any of these modes of expression. Mathematical 
understanding involves memory processes as well as processes of 
associating ^ne item of knowledge with another. 
(IV«> Mathematical Application 

Mathematical application refers to the use of mathematical 
knowledge, aklll and understanding. Mathematical application relies on 
ineaory. algorithmic, translation and Judgment processes. 

Exercises that assess mathematical application require a sequence 
of processes that relate to the formulation, solution and interpretation 
of problems. The processes may include recalling and iccoding 
knowledge, selecting and carrying cuw algorithms, making and testing 
conjecturru and evaluating arguments and proportion, or it might require 
the demonstration that two geometric figures are congruent. 

The foregoing taxonomy of goals for mathematics education is 
b.iSicaUy consistent with others who have attempted to classify the 
goals of mathematics education. Begle (1979) reviews some of these 
ulscusalons and elaborates upon the classification developed for the 
National Longitudinal Study of Mathematics. Their categorization of 
objectives by cognitive level and content is quite similar to that of 
NAEP's second assessment (Figure 2). In planning for the 1977-1978 
assessment NAEP developed a ''blueprint- defining the relative emphasis 
of each category of objectives for each age group (Figure 3). Each cell 



EKLC 



8 

10 



of th€ blueprint table* specified the recomoonded number of exercises 
for the corresponding category or subdomain for a particular age group. 
Although the blueprint was used as a guide for the design of the tests, 
the actual distribution of exercise: departs somewhat, as can be seen in 
Table 1, %#hich gives the book!;:, d: itrlbutions for age 17. The 
proportion of items for a .pecii- .category varies considerably across 
booklets. For instance, tne blueprint specified that 242 of the items 
should be "understanding" items but one package contained no such items 
and the highest contained 23Z with an average across packages of 16Z. 
The main reason for the overall departure from the blueprint was that 
items at the -understanding" level presumably required longer than 
average completion times. NAEP weights the items by timing factors 
before determining how many items should be selected for a given 
category* 

Except fc ^fic circumstances where a booklet is designed to be 

used in conjunction with a calculator or "handout", e.g., ruler, the 
items are randc ly selected for a booklet. While tnis is advantageous 
f vu. c -vrix sampling po^nt of view, it neans that the content validity 
rust be cocked for each data set an investigator contemplates using. 

or-e booklets are strong in one content domain and other booklets in 
another* With the exception of booklet 12C, which was designed for a 
caiculat study, the booklets have at least a few items for each 
content and process category. This contributes to the content validity 
of the measurement of overall mathematics achievement. It also implies 
that researchers must be selective if their interests require 
measurement of specific subsets of this domain. 

(2) FACE VALIDITY 



The second step recomoendcd is to review the individual items for 

fece validity* We found some questions to be potentially ambiguous in 

aeaningt for instance: 

The Thompson^'s dinner bill totaled 
$28.7S. Mr. Thompson wants to leave 
I. ^Ip of about 13Z. About how much 
should he leave for the tip? 

If the student interprets the phrase **About how much" literally, then 
a rough guess is an acceptable answer. Apparently the NAEP scorers 
Interpreted the question more restrictively as only 232 of the 17-year- 
old students got this question "^right*** 

Another exercise had semantic problems: 

An advertisement for a sale Indicates that all merchandise 
has been reduced by 40 percent of its regular price. If 
the sale price of a washer is advertised at $144 « what was 
its regular price before the sale? 

The phrase "reduced by 40 percent** is not commonly used in retailing; 
instead the common terminology is "40% off" or "reduced 402." While 
not everyone may agree with our interpretation, the item was either very 
confusing or very difficult because only 42 got the correct answer. 

The seriousness of this problem may be less than is suggested here. 
As described by Holmes (1980), NAEP has followed unusually careful 
procedures to guard against any kind of hidden bias or ambiguity in the 
language of their test items. Never-the-less, we recommend that 
secondary analysts examine the specific items they incorporate into 
subtests. If any items appear questionable, they are condidaces for 
deletion. 

(3) BAD CASE ANALYSIS 

The third step calls for an additional quality review of the data. 
Specifically, cases suspected to have considerable missing information 



10 



should be examined. In checking these problematic cases we found only 
one or two c«8e;3 with seriously large blocks of missing data. For 
purposes of coapleteness we left these cases in the working data set. 
(4) CONSTRUCT VALIDITY 

The co.;struct properties of mathematics achievement were explored 
by first pooling the three data sets and then computing the subtest 
intercorrelatlons (Table 2). In general the mtercorrelations are very 
high; most are greater than 0.70. This structure implies that the 
subdooains represented by the subtests are not highly distinct from the 
global construct, oatheoatics achievement. It may be that the cognitive 
capacities required for one subdomain are quite similar to those 
required by another, or by nathematics in general. 

To explore this question further we performed a series of factor 
analyses on the MATSIO data set. Both the alpha and the principle 
component methods of factor extraction were applied to the complete set 
of items. The results were not particularly useful because the 
resulting factors appeared to be mainly a function of the multi-part 
exercise structure in NAEP booklets. A number of exercises, 
particularly in mathematics have several subquestions which NAEP calls 
-parts-. For instance one page might have 3 simple division problems 
requiring three different answers, but they are identified as a single 
exercise. Becavse the parts tend to be highly similar to each other ar.d 
m some cases build upon the preceding part, the intercorrelations tend 
to be unusually high. These clusters of correlations produce a 
sufficiently large amount of common variance and often are extracted as 
a single factor. This resulted in a great many factors and no 
meaningful structures. In order to avoid the multi-part exercise 



ERIC 



11 

13 



problem » we randomly selected one part from each of the multi-part 
questions* This reduced the item set to 31 items. A principle 
components, varimax-rotated factor matrix of these items is presented in 
Table 3. Four factors explaining 37Z of the total variance were 
produced, although the first factor accounted for most of that variance. 
Table 4 was assembled to provide assistance in the interpretation of the 
factor solution. Whenever an item had a factor loading greater than 
0.30, it was listed with that factor and the symbols for its content and 
process categories were entered adjace^c to it in the table. The 
syiabols are simply the first letter of each category label, e.g., N for 
Numbers; A for Application; S for Shape; V for Variables; M for 
Measurement; U for Understanding; and K for Knowledge and Skills. 
What is surprising about this is that both before and after reducing the 
items, no discernable structure is evident. The only clustering of 
items roughly representing a single cell in the objectives matrix is 
factor A, where half of the items are a combination of Numbers and 
Knowledge/Skill. The other factors have an even less defined conceptual 
character. 

When we ignored the prior assignment of items to categories of 
objectives, we still did not find clearly meaningful clusters of items 
associated with specific factors. The results suggest that learning in 
mathematics is probably not as segmented and multi-faceted as the 
definition of the content domain suggests. 

They also suggest that secondary analysts roust proceed with great 
caution in constructing subtests from NAEP data packages. Even if the 
subtest reliabilities are adequate, the items may not constitute a 
homouenou, independent cluster in all respects. 



EKLC 



12 

14 



(5) ITEM ANAi^YSIS 

The reliability of a test indicates how free it is from random 
error and certain types of indicator bias. In our analysis of 
reliability we applied the widely used coefficient of internal 
consistency » Cronbach's alphas Reliability coefficients were computed 
for all tne scales corresponding to all nine of the content /process 
cat3gorle8# These results, along with the number of items for each 
scale and the average proportion correct are given in Tables 5-7 for 
each of the three da"a sets. In these tables the ""complete"* tests 
include all of the avitilable items whereas the "refined" tests have a 
few Items deleted accori^lng to criteria which will be explained later. 
The reliability levels of the composite or total test are above 0.93 in 
every instance. This level of reliability is quite satisfactory and is 
consistent with the findings of Mullis (1978) from the 1976 
supplementary mathematics assessment. The shorter scales or subtests 
not surprisingly have lower reliability levels and these vary from 
package to package. They are mnst sensicive to differences in the number 
of relevant items in a package. For example, data MATSOl has only 4 
shape (geometry) items » which have an alpha of only 0.36, whereas MATS03 
has 14 shape items and they have a reliability of 0.84. Nunnally (1978) 
recommends that reliabilities of tests be at least 0.70 for "exploratory 
research". Using this criterion we find an acceptable reliability for 
each test and subtest in at least one ciata set with the exception of the 
••measurement** scale and the "other topics" scale. There are typically 
only a few items for each of these two areas and perhaps most 
importantly, their respective domains are not defined as homogeneous, 
unified categories. 



ERIC 



13 



Id 



The deletion of items for the "refined" tests were based upon 
traditional item analsis techniques. Specifically, items were dropped 
if they had either (1) a point-biserial correlation of less than 0.30, 
or (2) an extreme p-value, i.e.» less than 102 or more than 90% answered 
the item correctly. As can be seen in Tables 5, 6 and 7, the 
reliabilities of the "refined" testj are roughly the fame as the 
"complete" tests, even though the number of iteas is ger-rally reduced. 
While it might seem desirable to use the reduced or refined version of 
the tests, thjre are strong arguments in favor of us-ng th 2 "complete" 
tests. The philosophy of assessment argues that the full range of 
ability, not Just the- average, should be tested (Tyler, 1970). If 
criteria of discrimination, i.e., high biserial correlation, are 
applied, then items which are extreme tend to be eliminated. Since the 
objective of assessment is to measure total performance of all students, 
very easy and very difficult items should be included as long as they 
reside within the definition of the domain of concent. Thus, while the 
-refined" test is more efficient, the "complete" test is probably more 
valid in that it represents the domain more fully. Our recommendation is 
that "complete" tests or subtests be utilized in secondary analysis 
except when the reliability is low. 
(6) EXAMINE SHAPE CF DISTRIBUTION 

The final step in our recommended procedures specifies the 
constrjction of a total score for achievement and a visual examination 
of its distribution within the sample. Nunnally(1978) stresses the 
importance of this activity for the purpose of identifying whether or 
not the distribution is symmetrical and/or skewed. A sample 
distribution from one data set is displayed in Figure 4. It reveals 



ERIC 



that the distribution Is quite symmetrical and only has a small upward 
skew* The average percent correct on this set of items was 

SUMMARY 

The advent of NAEP public-use data files opens up a wealth of 
possibilities for those who have the skills, time, and resources to do 
secondary analysis* Our analysis of the mathematics test items has 
demonstrated the potential for developing indicators of mathematics 
achievement* This analysis has also demonstrated that the NAEP item 
subsets will not always meet conventional psychometric criteria. This 
failure to meet standard achievenvant test criteria does not mean that 
secondary analysis of the data is unwise. But it does imply that 
Interpretation of findings, especially those using subtests, must be 
made cautic-isly. Limitations of the methodology must be acknowledged. 

Conventional achievement testing is not item-centered like 
assessment testing. The measurement priority of assessment is stability 
across multiple testings, not relative comparisons among persons. 
Consequ^.ntly, standards of item discrimination and construct validity 
have obviously less import. Of far greater importance for assessments 
are standards of face validity, content validity, internal consistency 
and the application of rigorous data analysis techniques^ Our 
recommended procedures should be followed any time a secondary analyst 
seeks to utilize a package-level test or subtest of performance or 
achievement* Prudent methodology will insure the discovery of the 
substantive potential burled in the NAEP data base. 



BIBLIOGRAPHY 



ERIC ^7 



Anerican PsychologicaX Association. Standards for Educational and 
Psychological Tests. Washington, D.C.: American Psychological 
Association. (1966) 

Anderson, R.E., Welch, W.. and L. Harris. "Methodological Problems in 
Secondary Analysis of Data from the National Assessment for Educational 
Progress-. Minnesota Center for Social Research. University of 
Minnesota, Minneapolis, MN. (1980) 

Beele E.G. Critical Variables ir Mathematics Education. Mathematical 
Association of America and the National Council of Teachers of 
Mathematics, Washington, D.C (1979) 

Guilford, J.P. Psychometric Methods. New York: McGraw-Hill Book 
Company, Inc. (1954) 

Holmes, B.J. "Bias: Psychometric and Social Implications for the 
Natlonil Assessment for Educational Progress". Denver. Colorado: 
Education Commission of the States. (1980) 

Lord, P.M. and M.R. Novick. Statistical Theories of Mental Test Scores. 
Reading, Mass: Addlnon- Wesley. (1968) 

Millman, Jason. "Strategies for Constructing Criterion-Referenced 
Assessment Instruments", Cornell University. (1978) 

MulUs, I. Effects of Home and School on Learning Mathematics and 
PolltlLil Knowledge and Attitudes. University of Colorado, PhD 
Dissertation, Boulder, CO. (1978). 

National Assessment of Educational Progress. Mathematics O^J^^J^^^; 
Second Assessment. Denver, CO: Education Commission of the States 
(1978). 

National Assessment of Educational Progress, /'^"'^""^.f 
1977-78 Mathematics Assessment, Education Commission of the States. 
Suite 700, 1860 Lincoln Street, Denver, Colorado, 80295. (1980). 

Nunnally, J.C. Psychometric Theory, Second EJition. New York: 
McGraw-Hill. (1978) 

Tyler, Ralph W. "National Assessment: A History and Sociology." School 
and Society, December, 1970, pp. A71-77. 

Wooer, F.B. Developing a Large Scale Assessment Program. Denver, CO: 
Cooperative Accountability Project, 1973. 

WrlKht, D.J., Larsson, I.E.. Ramlow, C.E., "Nationc^l Assessment 
Public-Use Tapis,- Denver, CO: Education Commission of the i>tates, 1981. 



ERIC 



16 



18 



FIGURE 1 



PROCEDURES FOR EVALUATING ACHIEVEMENT INDICATORS 



Perform content validit y analyses by examining domain objectives 
«nd number of data package Items falling Into each cell of the 
objectives categorization to determine if all subdomalns are 
Adequately represented. 

Perform face validity an-^lysis on all exercises and iter by 
examining the questions and the response distributions to deter- 
mine if any items are seriously questionable. 

Check for bad cages by listing those cases 

with a composite score of zero, or 
those possessing a problematic value on the package 
condition code (PKGCON), which generally indicates 
partial completion. If the listing reveals either 
total nonresponse on math exercises or total non- 
response on background items, then reject that 
particular case. 

Perform factor analysis of items to evaluate construct validit y 
of items. 

Perform item analysis to obtain reliability estimates and to 
identify contributions ot individual items. 

Produce and examine histograms of achievement scores for the 
complete set of exercises to evaluate properties of the 
distribution including number of extreme scores. 



19 



FIGURE 2 



FRAMEWORK FOR OBJECTIVES, 1977-78 MATHE>L\TICS ASSESSMENT 



CONTENT 



Ul 

u 
o 

2 III. 



IV. 



Mathematical 
knov^cclge 

Mathematical 
skill 

Mathematical 
understanding 

Mathematical 
application 



A. 

Numbers 

and 
Numara- 

ttcn 



B. 

Variables 

;ind 
Relation- 
ships 



C. 

S*70 

and 
Position 



Measure- 



E. 
Othor 
Topics 



20 



FIGURE 3 



BLUEPRINT DEFINING RELATIVE SIZE OF CONTENT AND PROCESS 
CATEGORIES OF MATHEMATICS 
OBJECTIVES* 



Approximatt Numbor of Exercis«s 
by Ag« and Content 

Age 9 Age 13 A<je 17 



A, 


lumbers and 
numeration 


no 


150 


ICO 


B. 


VariaNes and ' 
relationships 


20 


40 


90 


C 


Shape, si7c 
and posikion 


30 


CO 


70 


O. 


• 

Measurement 


40 1 


50 


GO 


E. 


OtLL*r topics 


f 

30 j 


bO 


50 



Approximate Numlx*r of Exercises 
by Agt> and Process 





Ago 9 


Aoc 13 


Age 17 




1 45 


45 


55 


nattcal 


C5 


o5 


110 


M «t!irrr>jticjl undiv standing 


60 


105 


105 


r.t.tni*rnjttcal 3pphc:afion 


60 


115 


1G0 



^1 



Figure 4 



DISTRIBUTION OF MATHEMATICS ACHItVEMENT TOTAL SCOR£ 
(COMPLETE TEST OF 58 ITEMS) FOR 17 YEAR OLDS ir; 
1977-78 riAEP DATA SET MATSIO (^J = 229^^) 



bCORE COUNT 

57 6 mmm.mm^. 

56 7 ••••••• 

5*> 9 ••••••••• 

5H 

5^ 19 
52 20 
51 21 
50 

«f9 2^ 94 

«*7 27 

46 29 ••••••••••••••••• 

45 26 

4H 2o ♦••••••«««r«««««^«««*«««,«««^ 

4 J 26 

42 34» ••^ 

- 41^ Z:^^>f««t«^t«ji^»«aA4AAj:4tAjiA.«^«t^ftiaJ 

40 32 ^•^m^mm^^m•m^mm^^^^^^^*^^^^m•m^ > 

39 2b •••mmm.mmm^m^^^mmm^mm^^ue^ 

36 41 •••••••••^•••v^«««« 

37 3h ••••••••••••••• 

36 35 •^•••^••♦•••^•••••••••••••••^^•^^•^ 

35 3b 

34 34 ••••••••mm^mmmmmm^^m^mmmtmm^^^mmmm 

33 36' r*^^ 

32 31 

31 25 ••••••••••••••••••••••••• 

28 25 

27 20 - - 

26 30 

2b X7 

2«» 21 

23 22 

22 23 

21 * 15 

2U 13 ••••••••••••• 

19 12 •••••••••••• 

la 7 

17 8 

16 a •••••••• 

15 11 

14 6 

13 7 

1^ n ^••^ 

11 1 • 

10 2 

9~ - 2 

l» 1 • 

7 2 

6 3 - • 

5 0 

4 1 • 

3 0 

2 0 

1 1 • 

0 4 ^^mm 



22 



TABLE 1 



PEPCENT OF TOTAL MATHEMATICS ACHIEVEMENT 
ITEMS IN EACH CONTENT AND PROCESS 
CATEGORY FOR 1977-78 AGE 17 DATA 



Booklet Ider.tificatior. No- 



Blue- 



CONTENT CLASSES 


1 


2 


3H 


4 


5 


1 

6 


7 


8 


9H 


11 


12C 


10 


Print 


A. NUMBERS 


58 


41 


38 


46 


38 


37 


56 


47 


36 


50 


40 


S3 


37% 


B. VARIABLES 


13 


18 


16 


22 


20 


18 


11 


14 


19 


15 


4 


14 


21% 


C. SHAPE 


7 


io 


24 


18 


22 


11 


11 


14 


20 


15 


0 


19 


16% 


D. MLASUREMENT 


11 


11 


10 


5 


7 


16 


6 


8 


14 


10 


2 


3 


14% 


E. CITHER TOPICS 


11 


10 


12 


9 


13 


19 


17 


18 


11 


11 


54 


10 


12% 



100 100 100 100 100 100 100 100 100 100 100 100 



100% 



All 
Books 

45% 

17% 

17% 
9% 
12% 



100% 



PROCESS CLASSES 



I.& II. 

Knowledge & 
skills 


60 


56 


64 


72 


67 


56 


70 


54 


67 


66 


100 


58 


38% 


63% 


Ill- 
Understanding 


18 


23 


21 


11 


7 


19 


15 


20 


17 


8 


0 


IS 


24% 


16% 


IV. 

Applications 


22 


21 


16 


17 


25 


25 


14 


26 


16 


26 


0 




37% 


21% 




100 


100 


100 


100 


100 


100 


100 


100 


100 


100 


100 


100 


100% 


100% 


N OF ITEMS 1 


55 1 


61 1 


58 1 


65. 


55] 


57 1 


1 


5M 


64 1 


62 1 


48 1 


58 j 


430 


654 



ERIC 



23 



4 



TABLE 2 

INTERCORRELATIONS A^IONG MATHEMATICS 
REFINED SUBTESTS AND COMPOSITE TEST, 
1977 N;-EP data sets MATSOI, MATS03, 
ANJ MATSlO POOLED (N=6,782 ) 



^ (1) , (2) ^ (3) ^ (4) ^ (5) ^ (6) ^(7) 

Numbers (arithmetic) (1) 

Variables (algebra) (2) • 72 

Shapes (geometry) (3) .63 .60 

Knowledge/Skills ^4) .93 .81 .74 

Understanding (5) .80 .74 .72 .77 

Application (6) .76 .67 .68 .72 .67 

Total (7) .94 .85 .78 .97 .87 .84 



ERIC 24 



TABLE 3 



FACTOR MATRIX (ROTATED) OF 31 SELECTED 
MATHEMATICS ITEMS FROM 1977-78 DATA SET MATSIO 



^t^^ FACTORS 





1 




J 


4 




- IS 


9*^ 


HA 
• Uu 


• 3Z 


FO S 


• jO 


9 1 


• 16 


. 19 




97 


_ n9 


AO 


• 44 


vol 


9A 


• 11 


• io 


• 47 




• !*♦ 


1 9 
• 1 ^ 


AO 


• 44 




9 1 
• Z i 


9 A 


A / 

• 04 


.54 




• !*♦ 


• 1 1 


• 2o 


• 21 






no 


• 22 


. 18 


F1 


9 ^ 


1 A 

• 14 


• 4 3 


A O 

.09 


F1 A 


• -J / 


91 


• 34 


.10 




9 A 


1 A 
• 1 4 


• 3o 


1 A 
. 10 


Fl 


9A 


9 1 


• I / 


^ A 

• 10 


Fl ft 


1 n 

• xu 




/. A 
• 4U 


.11 


Fl Qa 




"^9 




O C 

• 2d 


F20 




• jy 


«;a 


AC 

• 03 


E21 


20 


• ^ .> 


A 9 


• io 


E22a 


-20 


• JO 


9 A 


ml/ 


E23 


• 22 




.14 


.19 






1 .ft 
• xo 


9 A 


. 10 


E25c 


.11 


.31 


.20 


.42 


E26 


.15 


.49 


• 41 


.24 


E27a 


.10 


.18 


.17 


.33 


dZo 


• 34 


.03 


.09 


.15 


E29 


.33 


. 16 


.12 


.06 


F^D 


90 


• 'tz 


.21 


. 36 


E31 


.15 


.46 


.48 


.09 


E32 


.42 


.26 


.11 


.23 


E33 


.24 


.37 


.34 


.27 


E36 


.44 


.08 


.13 


.24 


E38 


.31 


.13 


.14 


.13 


E39 


.39 


.11 


.23 


.13 


El^^envalues 7 


.8 


J. 6 


1.2 


1.0 


Proportion of 










Total Var 










iance 


.25 


.05 


.04 


.03 



'Er|c 25 



TABLE 4 



CONTENT CATEGORIZATION OF ITEMS FROM FACTOR ANALYSIS 
OF THE 31 REDUCED ITEM SET FOR MATHEMATICS, 
1977-78 NAEP DATA SET MATSIO 



Factor 1 Factor 2 Factor 3 



"actor 4 



Item Item Item Item 

# Cent. Process # Cont. Process # Cont. Process ^ Cont. Process 

04b N K 19e N K 13a 0 A 4b N K 

SNA 20 0 A 14 S K 6b N U 

12 ^ A 21 V K 15 N A 7 V U 

14 S K 22a V A 18 O U 9d N K 

19e N K 23 N K 20 O A 9a N K 

28 S K 25c N K 21 V K 25g N K 

29 0* U 26 V K 26 V K 27a N K 
32 O A 30 V U 31 V K 30 V U 
36 N K 33 S K 33 S K 

38 O A 

39 S U 

* "O** represents the "Other topics" content category (including "Measurement"). 

** Items were included in this table whenever they had factor loadings greater 
than ,30. 



EKLC 



26 



TABLE 5 

RELIABILITIES OF MATHEMATICS SUBTESTS AND COflPOSITE TEST FOR 
COMPLETE AND REFINED TESTS, 1977-78 NAEP 
DATA SET MATSOl (N=2294') 







Conplete Test 






Refined Test 






N 




Average 


N 




Average 




of 


Reliability 


Percent 


of 


Reliability 


Percent 




Ctems 


(Alph a) 


Correct 


Items 


(Alphas 


Correct 








Nunbers (arithmetic) 


32 


.88 


.62 


26 


.88 


.54 


Variables (algebra) 


7 


.60 


.57 


6 


.61 


.61 


Shape (geometry) 


4 


.40 


.45 


3 


.33 


.53 


Measurement 


6 


.65 


.64 


6 


.65 


.64 


Other 


6 


.48 


.58 


4 


.49 


.61 


Knowledge/Ski lis 


33 


.87 


.66 


22 


.84 


.67 


Understanding 


10 


.69 


.56 


8 


.70 


.56 


Applications 


12 


.65 


.47 


8 


.65 


.46 


Total Test 


55 


.92 


.60 


45 


.92 


.57 



ERIC 



21 



TABLE 6 



RELIABILITIES Ol:- MATHEMATICS SUBTI'STS AND COMPOSITE TEST FOR 
COMPLETE AND REFINED TESTS, 1977-78 NAEP 
DATA SET t-lATSOa (N=2272) 



Ntiinbers (arithmetic) 
Variables (algebra) 
Shape (geometry) 
Measurement 
Other 

Knowledge/Ski lis 
Unders tandi ng 
Applications 

Total Test 



Conplete Test 

N Average 
of Reliability Percent 
I tems (Alpha) Correct 



22 



14 



36 



12 



58 



.85 



.76 



.84 



.59 



,53 



.90 



76 



.64 



,93 



.62 



Refined Test 

N Average 
of Reliability Percent 
Items (Alpha) Correct 



20 



30 



.51 



.59 



,63 



.50 



.59 



.48 



14 



32 



12 



.54 



53 



.85 



76 



.84 



,58 



53 



.90 



.76 



.63 



.93 



,60 



,34 



,51 



.42 



.63 



.50 



.59 



.53 



.53 



ERIC 



TABLE 7 

RELIABILITIES OF MATHEMATICS SUBTESTS AND COMPOSITE TEST FOR 
COMPLETE AND REFINED TESTS, 1977-78 NAEP 
DATA SET MATSlO (N = 2216) 







Complete Test 






Refined Test 








N 

of 
Items 


Reliability 
(Alpha") 


Ave rage 
Percent 
Correct 


N 

of 
Items 


Reliability 
(Alpha) 


Average 
Percent 
Correct 












Numbers (arithmetic) 


31 


.89 


.68 


25 


.89 


.71 




Variable s (algebra/ 


8 


.79 


.49 


7 


.80 


.43 




Shape (geometry) 


11 


.72 


.53 


9 


.72 


.58 




Measurement 


2 


. 37 


.49 


2 


.37 


.49 




Other 


6 


.52 


.46 


4 


.48 


.50 




Knowledge/Skills 


33 


.91 


.70 


27 


.90 


.67 




Understanding 


10 


.72 


.69 


9 


.70 


.66 




Applications 


14 


.75 


.40 


10 


.75 


.48 




Total Test 


58 


.93 


.62 


47 


.93 


.61 





ERIC 



